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Abstract. Metric learning methods have been shown to perform well 
on different learning tasks. Many of them rely on target neighborhood 
relationships that are computed in the original feature space and remain 
fixed throughout learning. As a result, the learned metric reflects the 
original neighborhood relations. We propose a novel formulation of the 
metric learning problem in which, in addition to the metric, the target 
neighborhood relations are also learned in a two-step iterative approach. 
The new formulation can be seen as a generalization of many existing 
metric learning methods. The formulation includes a target neighbor as- 
signment rule that assigns different numbers of neighbors to instances 
according to their quality; 'high quality' instances get more neighbors. 
We experiment with two of its instantiations that correspond to the met- 
ric learning algorithms LMNN and MCML and compare it to other met- 
ric learning methods on a number of datasets. The experimental results 
show state-of-the-art performance and provide evidence that learning the 
neighborhood relations does improve predictive performance. 

Keywords: Metric Learning, Neighborhood Learning 



1 Introduction 

The choice of the appropriate distance metric plays an important role in distance- 
based algorithms such as fc-NN and /c-Means clustering. The Euclidean metric 
is often the metric of choice, however, it may easily decrease the performance 
of these algorithms since it relies on the simple assumption that all features are 
equally informative. Metric learning is an effective way to overcome this limita- 
tion by learning the importance of difference features exploiting prior knowledge 
that comes in different forms. The most well studied metric learning paradigm 
is that of learning the Mahalanobis metric with a steadily expanding literature 
over the last years Il9ll3l3l2ll0ll8l9l5ll6| . 

Metric learning for classification relies on two interrelated concepts, similarity 
and dissimilarity constraints, and the target neighborhood. The latter defines for 
any given instance the instances that should be its neighbors and it is specified 
using similarity and dissimilarity constraints. In the absence of any other prior 
knowledge the similarity and dissimilarity constraints arc derived from the class 
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labels; instances of the same class should be similar and instances of different 
classes should be dissimilar. 

The target neighborhood can be constructed in a global or local manner. With 
a global target neighborhood all constraints over all instance pairs are active; 
all instances of the same class should be similar and all instances from different 
classes should be dissimilar [19I3J . These admittedly hard to achieve constraints 
can be relaxed with the incorporation of slack variables |13I2I10|9] . With a local 
target neighborhood the satisfiability of the constraints is examined within a lo- 
cal neighborhood [4lf 7110115] . For any given instance we only need to ensure that 
we satisfy the constraints that involve that instance and instances from its local 
neighborhood. The resulting problem is considerably less constrained than what 
we get with the global approach and easier to solve. However, the appropriate 
definition of the local target neighborhood becomes now a critical component 
of the metric learning algorithm since it determines which constraints will be 
considered in the learning process. [T5J defines the local target neighborhood of 
an instance as its k, same-class, nearest neighbors, under the Euclidean metric 
in the original space. Goldberger et al. [I] initialize the target neighborhood for 
each instance to all same-class instances. The local neighborhood is encoded as 
a soft-max function of a linear projection matrix and changes as a result of the 
metric learning. With the exception of [I], all other approaches whether global or 
local establish a target neighborhood prior to learning and keep it fixed through- 
out the learning process. Thus the metric that will be learned from these fixed 
neighborhood relations is constrained by them and will be a reflection of them. 
However, these relations are not necessarily optimal with respect to the learning 
problem that one is addressing. 

In this paper we propose a novel formulation of the metric learning problem 
that includes in the learning process the learning of the local target neighbor- 
hood relations. The formulation is based on the fact that many metric learning 
algorithms can be seen as directly maximizing the sum of some quality mea- 
sure of the target neighbor relationships under an explicit parametrization of 
the target neighborhoods. We cast the process of learning the neighborhood as 
a linear programming problem with a totally unimodular constraint matrix |14j . 
An integer 0-1 solution of the target neighbor relationship is guaranteed by the 
totally unimodular constraint matrix. The number of the target neighbors does 
not need to be fixed, the formulation allows the assignment of a different number 
of target neighbors for each learning instance according to the instance's quality. 
We propose a two-step iterative optimization algorithm that learns the target 
neighborhood relationships and the distance metric. The proposed neighborhood 
learning method can be coupled with standard metric learning methods to learn 
the distance metric, as long as these can be cast as instances of our formulation. 

We experiment with two instantiations of our approach where the Large Mar- 
gin Nearest Neighbor (LMNN) JTS] and Maximally Collapsing Metric Learning 
(MCML) [3] algorithms are used to learn the metric; we dub the respective in- 
stantiations LN-LMNN and LN-MCML. We performed a series of experiments 
on a number of classification problems in order to determine whether learning 
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the neighborhood relations improves over only learning the distance metric. The 
experimental results show that this is indeed the case. In addition, we also com- 
pared our method with other state-of-the-art metric learning methods and show 
that it improves over the current state-of-the-art performance. 

The paper is organized as follows. In section [2J we discuss in more detail the 
related work. In Section [3J we present the optimization problem of the Learn- 
ing Neighborhoods for Metric Learning algorithm (LNML) and in Section 2] we 
discuss the properties of LNML. In Section [5] we instantiate our neighborhood 
learning method on LMNN and MCML. In Section[6]we present the experimental 
results and we finally conclude with Section [7] 

2 Related Work 

The early work of Xing et al., [T!5], learns a Mahalanobis distance metric for 
clustering that tries to minimize the sum of pairwise distances between similar 
instances while keeping the sum of dissimilar instance distances greater than a 
threshold. The similar and dissimilar pairs are determined on the basis of prior 
knowledge. Globcrson & Roweis, [3J introduced the Maximally Collapsing Met- 
ric Learning (MCML). MCML uses a stochastic nearest neighbor selection rule 
which selects the nearest neighbor Xj of an instance Xj under some probabil- 
ity distribution. It casts the metric learning problem as an optimization prob- 
lem that tries to minimize the distance between two probability distributions, 
an ideal one and a data dependent one. In the ideal distribution the selection 
probability is always one for instances of the same class and zero for instances 
of different class, defining in that manner the similarity and dissimilarity con- 
straints under the global target neighborhood approach. In the data dependent 
distribution the selection probability is given by a soft max function of a Maha- 
lanobis distance metric, parametrized by the matrix M to be learned. In a similar 
spirit Davis et al., [2], introduced Information-Theoretic Metric Learning. ITML 
learns a Mahalanobis metric for classification with similarities and dissimilarities 
constraints that follow the global target neighborhood approach. In ITML all 
same-class instance pairs should have a distance smaller than some threshold and 
all different-class instance pairs should have a distance larger than some thresh- 
old. In addition the objective function of ITML seeks to minimize the distance 
between the learned metric matrix and a prior metric matrix, modelling like that 
prior knowledge about the metric if such is available. The optimization problem 
is cast as a distance of distributions subject to the pairwise constraints and 
finally expressed as a Brcgman optimization problem (minimizing the LogDct 
divergence). In order to be able to find a feasible solution they introduce slack 
variables in the similarity and dissimilarity constraints. 

The so far discussed metric learning methods follow the global target neigh- 
borhood approach in which all instances of the same class should be similar 
under the learned metric, and all pairs of instances from different classes dissim- 
ilar. This is a rather hard constraint and assumes that there is a linear projection 
of the original feature space that results in unimodal class conditional distribu- 
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tions. Goldbcrgcr et al., [I], proposed the NCA metric learning method which 
uses the same stochastic nearest neighbor selection rule under the same data- 
dependent probability distribution as MCML. NCA seeks to minimize the soft 
error under its stochastic nearest neighbor selection rule. It uses only similarity 
constraints and the original target neighborhood of an instance is the set of all 
same-class instances. After metric learning some, but not necessarily all, same 
class instances will end up having high probability of been selecting as nearest 
neighbors of a given instance, thus having a small distance, while the others will 
be pushed further away. NCA thus learns the local target neighborhood as a 
part of the optimization. Nevertheless it is prone to ovcrhtting, [2D], and docs 
not scale to large datasets. The large margin nearest neighbor method (LMNN) 
described in [17118] learns a distance metric which directly minimizes the dis- 
tances of each instance to its local target neighbors while keeping a large margin 
between them and different class instances. The target neighbors have to be 
specified prior to metric learning and in the absence of prior knowledge these 
are the k same class nearest neighbors for each instance. 

3 Learning Target Neighborhoods for Metric Learning 

Given a set of training instances {(xi, yi), (x2, 2/2) , ■ • ■ , (x„, y n )} where Xj £ K 
and the class labels yi £ {1,2, ... ,c}, the Mahalanobis distance between two 
instances Xj and Xj is defined as: 

D M (xi,Xj) = (xj -Xj) T M(xi - Xj) (1) 

where M is a Positive Semi-Definite (PSD) matrix (M >; 0) that we will learn. 

We can reformulate many of the existing metric learning methods, such 
as [1911313110118] , by explicitly parametrizing the target neighborhood relations 
as follows: 

min J2 Py-Fli(M,B) (2) 

ij,*¥ : 3,Vi=yj 

s.t. constraints of the original problem 

The matrix P, £ {0,1}, describes the target neighbor relationships which 
are established prior to metric learning and are not altered in these methods. 
Pij = 1, if Xj is the target neighbor of Xj, otherwise, Py = 0. Note that the 
parameters P^ and Py : yi 7^ yj are set to zero, since an instance Xj cannot be 
a target neighbor of itself and the target neighbor relationship is constrained to 
same-class instances. This is why we have i ^ j, yi = yj in the sum, however, for 
simplicity we will drop it from the following equations. i^j(M, S) is the term of 
the objective function of the metric learning methods that relates to the target 
neighbor relationship Py, M is the Mahalanobis metric that we want to learn, 
and H is a set of other parameters in the original metric learning problems, e.g. 
slack variables. Regularization terms on the M and H parameters can also be 
added into Problem H [HTDJ . 
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The Fjj(M, S) term can be seen as the 'quality' of the target neighbor rela- 
tionship under the distance metric M; a low value indicates a high quality 
neighbor relationship P^. The different metric learning methods learn the M 
matrix that optimizes the sum of the quality terms based on the a priori estab- 
lished target neighbor relationships; however, there is no reason to believe that 
these target relationships are the most appropriate for learning. 

To overcome the constraints imposed by the fixed target neighbors we propose 
the Learning the Neighborhood for Metric Learning method (LNML) in which, 
in addition to the metric matrix M, we also learn the target neighborhood 
matrix P. LNML has as objective function the one given in Problem [5] which 
we now optimize also over the target neighborhood matrix P. We add some new 
constraints in Problem [5] which control for the size of the target neighborhoods. 
The new optimization problem is the following: 



K m i n and K max are the minimum and maximum numbers of target neighbors 
that an instance can have. Thus the second constraint controls the number of 
target neighbor that Xj instance can have. K av is the average number of target 
neighbor per instance. It holds by construction that K max > K av > K m i n . 
We should note here that we relax the target neighborhood matrix so that its 
elements Py take values in [0, 1] (third constraint). However, we will show later 
that a solution Py G {0, 1} is obtained, given some natural constraints on the 
K m in, K max and K av parameters. 

3.1 Target neighbor assignment rule 

Unlike other metric learning methods, e.g. LMNN, in which the number of target 
neighbors is fixed, LNML can assign a different number of target neighbors for 
each instance. As we saw the first constraint in Problem [3] sets the average 
number of target neighbors per instance to K avi while the second constraint 
limits the number of target neighbors for each instance between K m i n and K max . 
The above optimization problem implements a target neighbor assignment rule 
which assigns more target neighbors to instances that have high quality target 
neighbor relations. We do so in order to avoid overfitting since most often the 
'good' quality instances defined by metric learning algorithms |3I18| are instances 
in dense areas with low classification error. As a result the geometry of the dense 
areas of the different classes will be emphasized. How much emphasis we give 




(3) 





1 > Py > 

constraints of the original problem 
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on good quality instances depends on the actual values of K m i„ and K max . In 
the limit one can set the value of K m i n to zero; nevertheless the risk with such 
a strategy is to focus heavily on dense and easy to learn regions of the data and 
ignore important boundary instances that are useful for learning. 



4 Optimization 

4.1 Properties of the Optimization Problem 

We will now show that we get integer solutions for the P matrix by solving a 
linear programming problem and analyze the properties of Problem [3l 

Lemma 1. Given M, 3, and K max > K av > K min then e {0, 1}, if K min , 
K m ax and K av are integers. 

Proof. Given M and H, ^ (M, H) becomes a constant. We denote by p the 
vectorization of the target neighborhood matrix P which excludes the diagonal 
elements and P^j : yi ^ y,j , and by f the respective vectorized version of the Fij 
terms. Then we rewrite Problem [3] as: 



min p f 
p 

s.t. (-Zv m , 



, K rnax ,K av * n) T > Ap > 



t Kmim K av * n) 



1 > p t > 



(4) 



The first and second constraints of Problem [3] arc reformulated as the first con- 
straint in Problem [4] A is a (n + 1) x (J] 



is the number of instances in class c; 



n) constraint matrix, where n r 



1 
1 



1 1 






1 
1 



where 1 (0) is the vector of ones (zeros). Its elements depends on the its position 
in the matrix A. In its ith column, all 1 (0) vectors have rii — 1 elements, where 
Hi is the number of instances of class Cj with Cj = y Pi . According to the sufficient 
condition for total unimodularity (Theorem 7.3 in [T3]) the constraint matrix A 
is a totally unimodular matrix. Thus, the constraint matrix B = [I, —I, A, — A] T 
in the following equivalent problem also is a totally unimodular matrix (pp.268 
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min p f 
p 

s.t. Bp < e 

^ ( 1 7 ' ' ' ) 1 ; Q; ' ' ' ; Q ; K-max ; ' ' ' ) -^max ; 



(5) 



Since e is an integer vector, provided K m i n , K max , and if,™, are integers, and 
the constraint matrix B is totally unimodular, the above linear programming 
problem will only have integer solutions (Theorem 19.1a in [I2])- Therefore, for 
the solution p it will hold that p.; G {0, 1} and consequently Py € {0, 1}. 

Although the constraints to control the size of the target neighborhood are 
convex, the objective function in Problem[3jis not jointly convex in P and (M, H). 
However, as shown in Lemma [TJ the binary solution of P can be obtained by a 
simple linear program if we fix (M, H). Thus, Problem [3] is individually convex 
in P and (M, H), if the original metric learning method is convex; this condition 
holds for all the methods that can be coupled with our neighborhood learning 
method |19I13|3I1()I1H| . 



4.2 Optimization Algorithm 

Based on Lemma [T] and the individual convexity property we propose a gen- 
eral and easy to implement iterative algorithm to solve Problem [31 The details 
are given in Algorithm [TJ At the first step of the fcth iteration we learn the 
binary target neighborhood matrix P( fc ' under a fixed metric matrix M' fe_1 ^ 
and E( fe_1 ), learned in the k — 1th iteration, by solving the linear programming 
problem described in Lemma [TJ At the second step of the iteration we learn the 
metric matrix M (fc) and H (fc) with the target neighborhood matrix P^ using as 
the initial metric matrix the M^" -1 ^. The second step is simply the application 
of a standard metric learning algorithm in which we set the target neighborhood 
matrix to the learned P( fe ) and the initial metric matrix to M^ 0-1 ). The conver- 
gence of proposed algorithm is guaranteed if the original metric learning problem 
is convex (TJ. In our experiment, it most often converges in 5-10 iterations. 



5 Instantiating LNML 

In this section we will show how we instantiate our neighborhood learning 
method with two standard metric learning methods, LMNN and MCML, other 
possible instantiations include the metric learning methods presented in [19113110] . 



8 Wang et al. 



Algorithm 1 LNML 

Input: X, Y, M°,H°, K min , K max and K av 

Output: M 

repeat 

P ( ' i;) =LearningNeighborhood(X,Y,M (fc ~ 1) ,H ( ' i; ~ 1) ) by solving Problem H 
( M (fc) ( H( fe ))=MetricLearning(M (fc - 1) ,P (fc) ) 
fc:= k + 1 
until convergence 



5.1 Learning the Neighborhood for LMNN 

The optimization problem of LMNN is given by: 



where the matrix Y, Yy € {0,1}, indicates whether the class labels j/j and yj 
are the same (Y.y = 1) or different (Y^ = 0). The objective is to minimize the 
sum of the distances of all instances to their target neighbors while allowing for 
some errors, this trade off is controlled by the fi parameter. This is a convex 
optimization problem that has been shown to have good generalization ability 
and can be applied to large datasets. The original problem formulation corre- 
sponds to a fixed parametrization of P where its non-zero values are given by 
the k nearest neighbors of the same class. 

Coupling the neighborhood learning framework with the LMNN metric learn- 
ing method results in the following optimization problem: 



(6) 



ij I 



s.t. D M (xi,xi) - D M (xi,Xj) > 1 
M h 




(7) 



= mm 

M,P,£ 



^Py{(l-/i)D M 



ij I 





1 > Py > 

D M (Xi,X;) - L» M (x l ,X i ) > 1 - 

Ziji > o 
M h 
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We will call this coupling of LNML and LMNN LN-LMNN. The target neighbor 
assignment rule of LN-LMNN assigns more target neighbors to instances that 
have small distances from their target neighbors and low hinge loss. 



5.2 Learning the Neighborhood for MCML 

MCML relies on a data dependent stochastic probability that an instance x,,- is 
selected as the nearest neighbor of an instance x»; this probability is given by: 

e -DM(ii,x,-) g— Dm{X(,Xj) 

Pm(j\i) = J = ^ ' e _ gM(x „ Xfc) > * + 3 

(8) 

MCML learns the Mahalanobis metric that minimizes the KL divergence dis- 
tance between this probability distribution and the ideal probability distribution 
p Q given by: 

Po(j\i) = ^ s -,Po(i\i) = (9) 

where = 1, if instance Xj is the target neighbor of instance Xf, otherwise, 
Pij = 0. The optimization problem of MCML is given by: 

min y^KL[p (j\i)\pM(j\i)} (10) 

v^p (-PM(xj,Xj) +\ogZ. l ) 

s.t. m >- o 



mm 



Like LMNN, this is also a convex optimization problem. In the original problem 
formulation the ideal distribution is defined based on class labels, i.e. = 1, 
if instances x$ and Xj share the same class label, otherwise, P.y = 0. 

The neighborhood learning method cannot learn directly the target neighbor- 
hood for MCML, since the objective function of the latter cannot be rewritten in 
the form of the objective function in Problem[3J due to the denominator Pjfc. 
However, if we fix the size of the neighborhood to Pj^ = K av = K m i„ = 
Kmax the two methods can be coupled and the resulting optimization is given 
by: 



mm 

M.P 



ij 

(-Pm(x»,Xj) + \ogZ l ) 

t. J> 



mm 

M.P 

i,j 



1,3 
3 

M >- 
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We will dub this coupling of LNML and MCML as LN-MCML. The original 
MCML method follows the global approach in establishing the neighborhood, 
with LN-MCML we get a local approach in which the neighborhoods are of fixed 
size K av for every instance. 

6 Experiments 

With the experiments we wish to investigate a number of issues. First, we want 
to examine whether learning the target neighborhood relations in the metric 
learning process can improve predictive performance over the baseline approach 
of metric learning with an apriori established target neighborhood. Second, we 
want to acquire an initial understanding of how the parameters K m i n and K max 
relate to the predictive performance. To this end, we will examine the pre- 
dictive performance of LN-LMNN with two fold inner Cross Validation (CV) 
to select the appropriate values of K m i n and K max , method which we will 
denote by LN-LMNN(CV), and that of LN-LMNN, with a default setting of 
K m in = K max = K av . Finally, we want to see how the method that we pro- 
pose compares to other state of the art metric learning methods, namely NCA 
and ITML. We include as an additional baseline in our experiments the per- 
formance of the Euclidean metric (EucMctric). We experimented with twelve 
different datasets: seven from the UCI machine learning repository, Sonar, Iono- 
sphere, Iris, Balance, Wine, Letter, Isolet; four text mining datasets, Function, 
Alt, Disease and Structure, which were constructed from biological corpora [7]; 
and MNIST [8], a handwritten digit recognition problem. A more detailed de- 
scription of the datasets is given in Table [1] 

Since LMNN is computationally expensive for datasets with large number 
of features we applied principal component analysis (PCA) to retain a limited 
number of principal components, following |18j . The datasets to which this was 
done were the four text mining datasets, Isolet and MNIST. For the two latter 
173 and 164 principal components were respectively retained that explain 95% of 
the total variance. For the text mining datasets more than 1300 principal com- 
ponents should be retained to explain 95% of the total variance. Considering the 
running time constraints, we kept the 300 most important principal components 
which explained 52.45%, 47.57%, 44.30% and 48.16% of the total variance for 
respectively Alt, Disease, Function and Structure. We could experiment with 
NCA and MCML on full tranining datasets only with datasets with a small 
number of instances due to their computational complexity. For completeness 
we experimented with NCA on large datasets by undcrsampling the training 
instances, i.e. the learning process only involved 10% of full training instances 
which was the maximum number we could experiment for each dataset. We 
also applied ITML on both versions of the larger datasets, i.e. with PCA-bascd 
dimensionality reduction and the original ones. 

For ITML, we randomly generate for each dataset the default 20c 2 constraints 
which are bounded rcpectively by the 5th and 95th percentiles of the distribution 
of all available pairwise distances for similar and dissimilar pairs. The slack 
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Datasets 


Description 


Sample 


# Feature 


# Class 


# Retained PCA Components % Explained Variance 


Sonar 




208 


60 


2 


NA 


NA 


Ionosphere 




351 


34 


2 


NA 


NA 


Wine 




178 


13 


3 


NA 


NA 


Iris 




150 


4 


3 


NA 


NA 


Balance 




625 


4 


3 


NA 


NA 


Letter 


character recognition 


20000 


16' 


26 


NA 


NA 


Function 


sentence classification 


3907 


2708 


2 


300 


44.30% 


Alt 


sentence classification 


4157 


2112 


2 


300 


52.45% 


Disease 


sentence classification 


3273 


2376 


2 


300 


47.57% 


Structure 


sentence classification 


3584 


2368 


2 


300 


48.16% 


Isolet 


spoken character recognition 


7797 


619 


26 


173 


95% 


MNIST 


handwritten digit recognition 


70000 


784 


26 


164 


95% 



variable 7 is chosen form {10*}|__ 4 using two-fold CV. The default identity 
matrix is employed as the rcgularization matrix. For the different instantiations 
of the LNML method we took care to have the same parameter settings for the 
encapsulated metric learning method and the respective baseline metric learning. 
For LN-LMNN, LN-LMNN(CV) and LMNN the regularization parameter [i that 
controls the trade-off between the distance minimization component and the 
hinge loss component was set to 0.5 (the default value of LMNN). For LMNN 
the default number of target neighbors was used (three) . For LN-LMNN, we set 
Kmin = K max = K av = 3, similar to LMNN. To explore the effect of a flexible 
neighborhood, the values of the K m i n and K max parameters in LN-LMNN(CV) 
were selected from the sets {1, 4, 3} and {2, 5, 3} respectively, while K av was fixed 
to three. Similarly for LN-MCML we also set K av = 3. The distance metrics for 
all methods are initialized to the Euclidean metric. As the classification algorithm 
we used 1-Nearest Neighbor. 

We used 10-fold cross validation for all datasets to estimate classification 
accuracy, with the exception of Isolet and MNIST for which the default train 
and test split was used. The statistical significance of the differences were tested 
with McNemar's test and the p-value was set to 0.05. In order to get a better 
understanding of the relative performance of the different algorithms for a given 
dataset we used a ranking schema in which an algorithm A was assigned one point 
if it was found to have a significantly better accuracy than another algorithm B, 
0.5 points if the two algorithms did not have a significantly different performance, 
and zero points if A was found to be significantly worse than B. The rank of an 
algorithm for a given dataset is simply the sum of the points over the different 
pairwise comparisons. When comparing N algorithms in a single dataset the 
highest possible score is N — 1 while if there is no significant difference each 
algorithm will get (N — l)/2 points. 

6.1 Results 

The results arc presented in Table [21 Examining whether learning also the neigh- 
borhood improves the predictive performance compared to plain metric learn- 
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ing, we see that in the case of LN-MCML, and for the five small datasets for 
which we have results, learning the neighborhood results in a statistically sig- 
nificant deterioration of the accuracy in one out of the five datasets (balance), 
while for the remaining four the differences were not statistically significant. 
If we now examine LN-LMNN(CV), LN-LMNN and LMNN we see that here 
learning the neighborhood does bring a statistically significant improvement. 
More precisely, LN-LMNN(CV) and LN-LMNN improve over LMNN respec- 
tively in six (two small and four large) and four (two small and two large) out 
of the 12 datasets. Moreove, by comparing LN-LMNN(CV) and LN-LMNN, we 
see that learning a flexible neighborhood with LN-LMNN(CV) improves signif- 
icantly the performance over LN-LMNN on two datasets. The low performance 
of LN-MCML on the balance dataset was intriguing; in order to take a closer 
look we tried to determine automatically the appropriate target neighborhood 
size, K av , by selecting it on the basis of five- fold inner cross validation from the 
set K av = {3, 5, 7, 10, 20, 30}. The results showed that the default value of K av 
was too small for the given dataset, with the average selected size of the target 
neighborhood at 29. As a result of the automatic tunning of the target neigh- 
borhood size the predictive performance of LN-MCML jumped at an accuracy 
of 93.92% which represented a significant improvement over the baseline MCML 
for the balance dataset. For the remaining datasets it turned out that the choice 
of K av = 3 was a good default choice. In any case, determining the appropriate 
size of the target neighborhood and how that affects the predictive performance 
is an issue that we wish to investigate further. In terms of the total score that 
the different methods obtain the LN-LMNN(CV) achieves the best in both the 
small and large datasets. It is followed closely by NCA in the small datasets and 
by LN-LMNN in the large datasets. 

7 Conclusion and Future Work 

We presented LNML, a general Learning Neighborhood method for Metric Learn- 
ing algorithms which couples the metric learning process with the process of es- 
tablishing the appropriate target neighborhood for each instance, i.e. discovering 
for each instance which same class instances should be its neighbors. With the 
exception of NCA, which cannot be applied on datasets with many instances, all 
other metric learning methods whether they establish a global or a local target 
neighborhood do that prior to the metric learning and keep the target neighbor- 
hood fixed throughout the learning process. The metric that is learned as a result 
of the fixed neighborhoods simply reflects these original relations which are not 
necessarily optimal with respect to the classification problem that one is trying 
to solve. LNML lifts these constraints by learning the target neighborhood. We 
demonstrated it with two metric learning methods, LMNN and MCML. The 
experimental results show that learning the neighborhood can indeed improve 
the predictive performance. 

The target neighborhood matrix P is strongly related to the similarity graphs 
which are often used in semi-supervised learning [B] , spectral clustering [T5] and 
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Table 2. Accuracy results. The superscripts + ~ = next to the LN-XXXX accuracy 
indicate the result of the McNemar's statistical test result of its comparison to the 
accuracy of XXXX and denote respectively a significant win, loss or no difference 
for LN-XXXX. Similarly, the superscripts +_= next to the LN-LMNN(CV) accuracy 
indicate the result of its comparison to the accuracies of LMNN and LN-LMNN. The 
bold entries for each dataset have no significant difference from the best accuracy 
for that dataset. The number in the parenthesis indicates the score of the respective 
algorithm for the given dataset based on the pairwise comparisons. 

(a) Small datasets 



Datasets 


MCML LN-MCML 


LMNN LN-LMNN LN-LMNN(CV) 


EucMetric 


NCA 


ITML 


Sonar 
Ionosphere 
Wine 
Iris 

Balance 


82.69(3.5) 84.62(3.5)= 
88.03 (3.0) 88.89(3.5)= 
91.57 (3.0) 96.07(4.0)= 
98.00(4.5) 96.00(3.5)= 
91.20 (5.0) 78.08 (1.0) ~ 


81.25(3.5) 81.25(3.5)= 83.17(3.5)== 
89.17(3.5) 87.75 (3.0)= 92.02(5.5)=+ 
94.38 (3.0) 97.75(5.5)+ 97.75(5.5)+= 
96.00(3.5) 94.00 (3.0)= 94.00 (3.0)== 
78.56 (1.0) 89.12 (4.5)+ 89.28 (4.5)+= 


80.77(3.5) 

86.32 (3.0) 
76.97 (0.0) 
96.00(3.5) 
78.72 (1.0) 


81.73(3.5) 
88.60(3.5) 
91.57 (3.0) 
96.00(3.5) 
96.32(7.0) 


82.69(3.5) 
87.75 (3.0) 
94.94(4.0) 
96.00(3.5) 

87.84 (4.0) 


Total Score 


19.0 15.5 


14.5 19.5 22.0 


11.0 


20.5 


18.0 



(b) Large datasets 



Datasets 


PCA+LMNN PCA+LN-LMNN PCA+LN-LMNN(CV) 


EucMetric 


PCA+EucMetric 


PCA+NCA 


ITML 


PCA+ITML 


Letter 


96.86 (5.0) 


97.71(6.5)+ 


97.64(6.5)+" 


90.02 (0.5) 


96.02 (0.5) 


96.48 (3.0) 


90.39 (3.0) 


96.39 (3.0) 


Function 


76.30 (2.5) 


76.73 (2.5) = 


78.91(6.0)++ 


78.73(6.0) 


76.48 ( 2.5) 


72.36 (0.0) 


78.73(6.0) 


76.45 (2.5) 


Alt 


83.98 (5.0) 


84.92(6.5) + 


85.37(0.5)+" 


68.51 (0.5) 


71.33 (2.0) 


78.54 (4.0) 


68.49 (0.5) 


72.53 (3.0) 


Disease 


80.23(4.0) 


80.14(4.0)" 


80.66(1.0)"" 


80.60(1.0) 


80.23(4.0) 


73.59 (0.0) 


80.60(4.0) 


80.14(4.0) 


Structure 


77.87 (4.5) 


78.83(6.0)" 


79.37(6.5)+" 


75.82 (1.5) 


77.00 (4.0) 


71.93 (0.0) 


75.79 (1.5) 


77.06 (4.0) 


Isolet 


95.96(6-0) 


95.06(6.0)- 


95.06(6.0)"" 


88.58 (1.5) 


88.33 (1.5) 


85.63(0.0) 


92.05 (3.5) 


91.08 (3.5) 


MNIST 


97.66(6-0) 


97.66(0.0)" 


97.73(6.0)"" 


90.91 (2.0) 


96.97 (2.0) 


90.58 (1.5) 


96.93 (1.5) 


97.09 (3.0) 


Total Score 


33 


37.5 


41.5 


16 


16.5 


8.5 


20 


23 



manifold learning [IT]. Most often the similarity graphs in these methods are 
constructed in the original space, which nevertheless can be quite different from 
true manifold on which the data lies. These methods could also profit if one is 
able to learn the similarity graph instead of basing it on some prior structure. 
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